Conversation
pombredanne
left a comment
There was a problem hiding this comment.
@ziadhany Thank you ++ do you mind to add a few tests?
Also may be rename this module to pysec.py?
pombredanne
left a comment
There was a problem hiding this comment.
BTW you may want to amend your commit message in 2fb225d
"Add PyPL OSV" could be something like:
Add PyPI OSV importer
Reference: https://github.com/nexB/vulnerablecode/issues/607
Signed-off-by: Ziad <ziadhany2016@gmail.com>
and you could squash your commits too
3598d65 to
cf3d432
Compare
TG1999
left a comment
There was a problem hiding this comment.
Thank You for your contribution, please check for items before accessing them, if they are not there log it, please check alpine importer for reference. I have added some comments for your consideration.
|
@ziadhany Please rebase your branch as well :) |
006b85b to
6a82e35
Compare
Hritik14
left a comment
There was a problem hiding this comment.
Thank you for researching on OSV! There are a few points I'd like you to consider
pombredanne
left a comment
There was a problem hiding this comment.
Thank you ++ for the updates!
I added some comments for your kind consideration.
vulnerabilities/importers/pysec.py
Outdated
| f"fixed_version InvalidVersion - PypiVersion - {raw_data['id'] !r}" | ||
| ) | ||
|
|
||
| if fix_range["type"] == "SEMVER": |
There was a problem hiding this comment.
How would it be possible for a PyPI range to use semver?
Also this code is too complicated for me to understand as is the same code above... can you find a way to decompose this in something simpler? Also it is duplicated almost exactly with the code above... so creating a function would make sense to avoid duplicated code.
There was a problem hiding this comment.
ex: GHSA-2ghc-6v89-pw9j.json
https://osv.dev/vulnerability/GHSA-2ghc-6v89-pw9j
There was a problem hiding this comment.
@ziadhany Thanks ... so in https://osv.dev/vulnerability/GHSA-2ghc-6v89-pw9j I see three different affected packages:
-
npm/body-parser-xmlwhich would become the purlpkg:npm/body-parser-xmland is reported as SEMVER versioning which would become thenpmvers scheme on our side. This is found at https://www.npmjs.com/package/body-parser-xml -
NuGet/body-parser-xml-nugetwhich would become the purlpkg:nuget/body-parser-xml-nugetand is reported as ECOSYSTEM versioning and would become thenugetvers scheme on our side... BUT this does NOT exist as a nuget package at all. https://www.nuget.org/packages?q=body-parser-xml ... not even close -
PyPI/body-parser-xml-pipwhich would become the purlpkg:nuget/body-parser-xml-pipand is reported as ECOSYSTEM versioning and would become thepypivers scheme on our side. BUT this does NOT exist as a pypi package at all: https://pypi.org/search/?q=body-parser-xml
So what to make of this? OSV and GitHub data are likely not really reviewed after all. That's not the first time I see junk ghost records like this that are supposedly reviewed.
In anycase this are not a Pypi package... but it is going to be hard to figure this out ... at least not at import time.
I note that this is not seen at all in https://github.com/pypa/advisory-database
So practically:
- I entered Add improver to validate that collected purl are real. #644 to add improvers to validate purls
- short term we could flag these with lower confidence but I am not sure we have enough at import time
- so we can import these ... each should use their own specific vers scheme though and PyPI is NOT semver alright
There is not much harm done to import non-existing purl BTW because these would never be found in the wild.
TG1999
left a comment
There was a problem hiding this comment.
Thanks Ziad, for your changes, I have added some comments for your consideration
|
@ziadhany there is a field called "database_specific" in test data, do see if we can derive some severities from there |
bf016a2 to
19d993a
Compare
df196c6 to
8766874
Compare
pombredanne
left a comment
There was a problem hiding this comment.
Looking great!
I am nitpicking on a few extra things because I am picky!, but the code overall looks nice as it is.
Some general remarks:
- create variables early rather than looking up a dict over and over
- a small docstring to explain what a function does is not mandatory but useful for no-trivial, non-obvious functions.
vulnerabilities/tests/test_pysec.py
Outdated
| } | ||
| ) == [ | ||
| VulnerabilitySeverity( | ||
| system=SCORING_SYSTEMS["cvssv3_vector"], |
There was a problem hiding this comment.
Are you sure this may not be CVSS:3.1 and cvssv3.1_vector instead? https://github.com/nexB/vulnerablecode/blob/47f6ae62d919fff5094b960c1f56bc0c6b8b4be3/vulnerabilities/severity_systems.py#L83
There was a problem hiding this comment.
No, I'm not sure.
but have a look at osv-schema:
https://ossf.github.io/osv-schema/#severitytype-field
Severity Type: CVSS_V3
Score Description : A CVSS vector string ...
ex GHSA-xc3p-ff3m-f46v.json:
"score": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:N"
vulnerabilities/importers/pysec.py
Outdated
| return [] | ||
|
|
||
|
|
||
| def get_aff_purl(affected_pkg, raw_id): |
There was a problem hiding this comment.
What about this? Spelling out whole words is fine
| def get_aff_purl(affected_pkg, raw_id): | |
| def get_affected_purl(affected_pkg, raw_id): | |
| """ | |
| Return an affected PackageURL given an ``affected_pkg`` mapping and a ``raw_id`` | |
| """ |
BTW, why do you need a raw_id there?
There was a problem hiding this comment.
just for logs (to track the root cause of errors and fix them easily)
vulnerabilities/importers/pysec.py
Outdated
| logger.error(f"purl affected_pkg not found - {raw_id !r}") | ||
|
|
||
|
|
||
| def get_aff_version_range(affected_pkg, raw_id): |
There was a problem hiding this comment.
Same comment as above...
| def get_aff_version_range(affected_pkg, raw_id): | |
| def get_affected_version_range(affected_pkg, raw_id): |
vulnerabilities/importers/pysec.py
Outdated
|
|
||
|
|
||
| def get_aff_version_range(affected_pkg, raw_id): | ||
| if "versions" in affected_pkg: |
There was a problem hiding this comment.
What about something like this instead?
affected_versions = affected_pkg.get("versions")
if affected_versions:
try:
return PypiVersionRange(affected_versions)
vulnerabilities/importers/pysec.py
Outdated
| if "versions" in affected_pkg: | ||
| try: | ||
| return PypiVersionRange(affected_pkg["versions"]) | ||
| except Exception as e: |
There was a problem hiding this comment.
What could be the exceptions raised here?
There was a problem hiding this comment.
I think InvalidVersionRange
vulnerabilities/importers/pysec.py
Outdated
| if "type" in fixed_range: | ||
| list_fixed = fixed_filter(fixed_range) | ||
| for i in list_fixed: | ||
| if fixed_range["type"] == "ECOSYSTEM": |
There was a problem hiding this comment.
Always prefer to create a named variable in such case:
| if fixed_range["type"] == "ECOSYSTEM": | |
| fixed_range_type = fixed_range["type"] | |
| if fixed_range_type == "ECOSYSTEM": |
pombredanne
left a comment
There was a problem hiding this comment.
See a few final nit pickings for your kind consideration.
As a general note, think about creating a variable early and then using this rather than doing multiple lookups in a mapping: this is more readable and more robust.
vulnerabilities/helpers.py
Outdated
|
|
||
| def dedupe(original: List) -> List: | ||
| """ | ||
| Remove all duplicate items and return a new list |
There was a problem hiding this comment.
| Remove all duplicate items and return a new list | |
| Remove all duplicate items and return a new list preserving ordering |
vulnerabilities/importers/pysec.py
Outdated
| if "affected" in raw_data: | ||
| for affected_pkg in raw_data["affected"]: | ||
| purl = get_affected_purl(affected_pkg, raw_data["id"]) | ||
| if purl.type == "pypi": |
There was a problem hiding this comment.
| if purl.type == "pypi": | |
| if purl.type != "pypi": | |
| logger.error(f"Non PyPI package found in PYSEC advisories: {purl} - from: {raw_data!r}") | |
| else: |
vulnerabilities/importers/pysec.py
Outdated
| for affected_pkg in raw_data["affected"]: | ||
| purl = get_affected_purl(affected_pkg, raw_data["id"]) | ||
| if purl.type == "pypi": | ||
| affected_version_range = get_affected_version_range(affected_pkg, raw_data["id"]) |
There was a problem hiding this comment.
Try to give a name early to avoid multiple lookups and carrying around less common names:
| affected_version_range = get_affected_version_range(affected_pkg, raw_data["id"]) | |
| vulnid = raw_data["id"] | |
| affected_version_range = get_affected_version_range(affected_pkg, vulnid) |
vulnerabilities/importers/pysec.py
Outdated
| if purl.type == "pypi": | ||
| affected_version_range = get_affected_version_range(affected_pkg, raw_data["id"]) | ||
| for fixed_range in affected_pkg.get("ranges", []): | ||
| fixed_version = get_fixed_version(fixed_range, raw_data["id"]) |
There was a problem hiding this comment.
| fixed_version = get_fixed_version(fixed_range, raw_data["id"]) | |
| fixed_version = get_fixed_version(fixed_range, vulnid) |
vulnerabilities/importers/pysec.py
Outdated
|
|
||
|
|
||
| def get_references(raw_data, severity) -> []: | ||
| if "references" in raw_data: |
There was a problem hiding this comment.
You replace the whole function body by:
| if "references" in raw_data: | |
| references = raw_data.get("references") or [] | |
| return [Reference(url=ref["url"], severities=severity) for ref in references if ref] |
Also what is severity here? a list since it afterward assigned to severities?
how does it apply to the whole list of references and not just to a single one?
Do you have an example of data?
There was a problem hiding this comment.
reference field (url) and a severity field are separated from each other in osv-schema. I think we can make use of ADVISORY type field in reference and assign the severities to it but What if there is two ADVISORY ?
{
"database_specific": {
"severity": "MODERATE",
"github_reviewed": true,
"cwe_ids": []
},
"references": [
{
"type": "ADVISORY",
"url": "https://nvd.nist.gov/vuln/detail/CVE-2011-1950"
},
{
"type": "WEB",
"url": "https://exchange.xforce.ibmcloud.com/vulnerabilities/67695"
},
{
"type": "ADVISORY",
"url": "https://github.com/advisories/GHSA-2qx8-589j-gcpx"
},
...
]
}
{
"database_specific": {
"severity": "CRITICAL",
}
"severity": [
{
"type": "CVSS_V3",
"score": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H"
}],
"references": [
{
"type": "ADVISORY",
"url": "https://nvd.nist.gov/vuln/detail/CVE-2022-28346"
},
{
"type": "WEB",
"url": "https://github.com/django/django/commit/93cae5cb2f9a4ef1514cf1a41f714fef08005200"
},
{
"type": "WEB",
"url": "https://lists.debian.org/debian-lts-announce/2022/04/msg00013.html"
},
{
"type": "PACKAGE",
"url": "https://github.com/django/django"
}
]
]
...}
Any suggestions?
vulnerabilities/importers/pysec.py
Outdated
| except ValueError: | ||
| logger.error(f"PackageURL ValueError - {raw_id !r} - purl: {package['purl'] !r}") | ||
|
|
||
| if "ecosystem" in package and "name" in package: |
There was a problem hiding this comment.
Same nit as above: you are doing two lookups in package: here and below on line 184.
You should collect ecosystem and name in variables once instead
vulnerabilities/importers/pysec.py
Outdated
| if "type" in fixed_range: | ||
| list_fixed = fixed_filter(fixed_range) | ||
| for i in list_fixed: | ||
| fixed_range_type = fixed_range["type"] |
There was a problem hiding this comment.
Should this be outside of the loop rather?
vulnerabilities/importers/pysec.py
Outdated
| if fixed_range_type == "GIT": | ||
| # TODO add GitHubVersion univers fix_version | ||
| logger.error(f"NotImplementedError GIT Version - {raw_id !r} - {i !r}") | ||
| else: |
There was a problem hiding this comment.
You should put error checks first at the top and invert this if/else block and if there no "type" you should return right way.
This will make the code simpler and remove one indentation level
vulnerabilities/importers/pysec.py
Outdated
| >>> fixed_filter({"type": "ECOSYSTEM","events":[{"introduced": "0"},{"fixed": "1.0.0"},{"fixed": "9.0.0"}]}) | ||
| ['1.0.0', '9.0.0'] | ||
| """ | ||
| filter_fixed = list(filter(lambda x: x.keys() == {"fixed"}, fixed_range["events"])) |
There was a problem hiding this comment.
What about this instead? It may be easier to read than a filter with a lambda?
for event in fixed_range.get("events") or []:
fixed = event.get("fixed")
if fixed:
yield fixed
vulnerabilities/importers/pysec.py
Outdated
| if "published" in raw_data: | ||
| return parse(raw_data["published"]) | ||
| else: | ||
| logger.warning(f"date_published not found {raw_data['id'] !r}") |
There was a problem hiding this comment.
I am not sure the date is important enough to warrant logging
vulnerabilities/importers/pysec.py
Outdated
| def get_severity(raw_data) -> []: | ||
| severities = [] | ||
| if "severity" in raw_data: | ||
| for sever_list in raw_data["severity"]: |
There was a problem hiding this comment.
| for sever_list in raw_data["severity"]: | |
| for sever_list in raw_data.get("severity") or []: |
vulnerabilities/importers/pysec.py
Outdated
| logger.warning(f"date_published not found {raw_data['id'] !r}") | ||
|
|
||
|
|
||
| def get_severity(raw_data) -> []: |
There was a problem hiding this comment.
get_severity --> get_severities since there are more than one
vulnerabilities/importers/pysec.py
Outdated
|
|
||
|
|
||
| def get_severity(raw_data) -> []: | ||
| severities = [] |
There was a problem hiding this comment.
IMHO you should rather yield as you go instead of accumulating in a list. The code will be leaner then
63ca14e to
a197a84
Compare
pombredanne
left a comment
There was a problem hiding this comment.
Thank you for the updates. A very last round few nitpickings for your consideration!
vulnerabilities/importers/pysec.py
Outdated
|
|
||
|
|
||
| def get_severities(raw_data) -> []: | ||
| if "severity" in raw_data: |
There was a problem hiding this comment.
This is no longer needed since the code below (for sever_list in raw_data.get("severity") or []: handles all the possible cases:
| if "severity" in raw_data: |
vulnerabilities/importers/pysec.py
Outdated
|
|
||
| affected_packages = [] | ||
| if "affected" not in raw_data: | ||
| logger.error(f"affected_packages not found - {raw_data['id'] !r}") |
There was a problem hiding this comment.
It is always best to return early if there is nothing left to do.
This way you can de-indent the code that comes after
| logger.error(f"affected_packages not found - {raw_data['id'] !r}") | |
| logger.error(f"affected_packages not found - {raw_data['id'] !r}") | |
| return |
vulnerabilities/importers/pysec.py
Outdated
| if "affected" not in raw_data: | ||
| logger.error(f"affected_packages not found - {raw_data['id'] !r}") | ||
|
|
||
| if "affected" in raw_data: |
There was a problem hiding this comment.
Following from above, this becomes not needed:
| if "affected" in raw_data: | |
| if "affected" in raw_data: |
vulnerabilities/importers/pysec.py
Outdated
| logger.error(f"affected_packages not found - {raw_data['id'] !r}") | ||
|
|
||
| if "affected" in raw_data: | ||
| for affected_pkg in raw_data["affected"]: |
There was a problem hiding this comment.
affected may not be a list, so to be safe, I would use this:
| for affected_pkg in raw_data["affected"]: | |
| for affected_pkg in raw_data.get("affected") or []: |
vulnerabilities/importers/pysec.py
Outdated
|
|
||
| if "affected" in raw_data: | ||
| for affected_pkg in raw_data["affected"]: | ||
| purl = get_affected_purl(affected_pkg, raw_data["id"]) |
There was a problem hiding this comment.
Use a raw_id variable defined above
vulnerabilities/importers/pysec.py
Outdated
| logger.error( | ||
| f"Non PyPI package found in PYSEC advisories: {purl} - from: {raw_data['id']!r}" | ||
| ) | ||
| else: |
There was a problem hiding this comment.
Return early and de-indent:
| else: | |
| return |
vulnerabilities/importers/pysec.py
Outdated
| f"Non PyPI package found in PYSEC advisories: {purl} - from: {raw_data['id']!r}" | ||
| ) | ||
| else: | ||
| vulnid = raw_data["id"] |
Reference: aboutcode-org#607 Signed-off-by: Ziad <ziadhany2016@gmail.com> add the necessary changes Signed-off-by: Ziad <ziadhany2016@gmail.com> remove aliases de-duplicate Signed-off-by: Ziad <ziadhany2016@gmail.com> reslove conflicts Signed-off-by: Ziad <ziadhany2016@gmail.com> add dateparser to setup.cfg Signed-off-by: Ziad <ziadhany2016@gmail.com> Resolving conflicts Signed-off-by: Ziad <ziadhany2016@gmail.com> resolve conflicts Signed-off-by: Ziad <ziadhany2016@gmail.com> add a require changes Signed-off-by: Ziad <ziadhany2016@gmail.com> add a require changes Signed-off-by: Ziad <ziadhany2016@gmail.com> add a require changes Signed-off-by: Ziad <ziadhany2016@gmail.com> add a require changes Signed-off-by: Ziad <ziadhany2016@gmail.com> add a require changes Signed-off-by: Ziad <ziadhany2016@gmail.com> add a require changes Signed-off-by: Ziad <ziadhany2016@gmail.com>
pombredanne
left a comment
There was a problem hiding this comment.
LGTM. Merging! Thank you ++ for your peseverance!r
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
| return requests.post(endpoint, headers=headers, json=graphql_query).json() | ||
|
|
||
|
|
||
| def dedupe(original: List) -> List: |
There was a problem hiding this comment.
You could also have implemented this using a OrderedDict from collections.
>>> import collections
>>> collections.OrderedDict.fromkeys(["z","i","a","a","d","d"]).keys()
odict_keys(['z', 'i', 'a', 'd'])
There was a problem hiding this comment.
good point! actually on Python 3.6+, this could be even shorter: dict.fromkeys(["z","i","a","a","d","d"]).keys()
#607
Signed-off-by: Ziad ziadhany2016@gmail.com